Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloning refactor and data race fix #393

Merged
merged 16 commits into from
Nov 15, 2023
Merged

Conversation

TwFlem
Copy link
Contributor

@TwFlem TwFlem commented Nov 4, 2023

Changes in this PR

This is an attempt to start refactor cloning for issue #367. This will also probably address issue #276.

One thing that I noticed in the clone package was all of the concurrency happening underneath golden gate. This made it a bit harder to grock what was going on. This also messed with the correctness of the program- hence #276. Perhaps more importantly, the unbounded number of go routines spinning up for pure, unblocking compute causes performance problems. So many go routines get created that the program begins spending most of its time thrashing back and forth between goroutines rather than just computing.

Here's some cpu profiles.

This is the cpu profile with the same benchmark in the changes in this PR with the concurrency that's currently in master:
cpu original-concurrenty-10s

This is just with the currency removed:
cpu no-concurrency-10s

You'll see that with concurrency removed, more time is spent actually running golden gate rather than coordinating goroutines and gc time. In fact, the bottleneck looks to be seqhash. There is also a large amount of time on the GC. GC time could probably be improved by taking an iterative approach- that way there isn't too much pressure on the groutine's stack to keep track of everything.

As far as what to make concurrent- I don't know enough about Poly of the field and how this actually gets used. If I had to throw out suggestions on where to best add some parallelism, it might be parallelizing multiple runs of GoldenGate or parallelize that for loop that calls to fan out the initial call to recurseLigate for each fragment and maybe synchronize updating whatever will keep track of existing seqhashes. Some discussion and experimentation could go a long ways here.

Also it could be super valuable to optimize seqhash 🤷‍♂️

This PR also introduces the concept of EnzymeManager. The idea here is to abstract away that global enzyme map state so consumers could have a straight forward way of managing their enzyme data throughout the lifetime of the program. For example, they could add, remove, reorder, whatever Poly would want to have EnzymeManager implement. The main impetus for this was the comment

Eventually, we want to get the data for this map from ftp://ftp.neb.com/pub/rebase

Something like the EnzymeManager could make that flexibility possible and allow consumers to load whatever data they want from wherever they want.

Honestly though, we can throw the EnzymeManager out- the other changes seem much more important.

Why are you making these changes?

starts resolving #367 and will resolve #276.

Are any changes breaking? (IMPORTANT)

The EnzymeManager related changes are breaking. This isn't important for addressing #367 and #276- this can be removed if need be.

Pre-merge checklist

All of these must be satisfied before this PR is considered
ready for merging. Mergeable PRs will be prioritized for review.

  • [x ] New packages/exported functions have docstrings.
  • [x ] New/changed functionality is thoroughly tested. (the existing tests pass)
  • [] New/changed functionality has a function giving an example of its usage in the associated test file. See primers/primers_test.go for what this might look like.
  • Changes are documented in CHANGELOG.md in the [Unreleased] section.
  • [x ] All code is properly formatted and linted.
  • [x ] The PR template is filled out.

@Koeng101
Copy link
Contributor

Koeng101 commented Nov 4, 2023

I was the original one to code clone, back when I wasn't very good at Go. I am sorry. Thank you for doing this.

Here's some cpu profiles.

I need zero convincing to change this out of concurrency, lol

Also it could be super valuable to optimize seqhash

I'm interested: how so? I'm not sure how we can check for equivalence, since boothLeastRotation is handling the circular sequences, where most time is spent.

taking an iterative approach- that way there isn't too much pressure on the groutine's stack to keep track of everything.

This makes sense to me. I've always thought an iterative approach would be better.

As far as what to make concurrent- I don't know enough about Poly of the field and how this actually gets used

This is somewhere I can help! I think I am the main user of this functionality, though I hope more folks will use it soon (and I use it to back some easier-to-use APIs for people). So, the idea is that I can use this for simulating GoldenGate assemblies that I am doing in the lab.

The thing was, I made this functionality to truly model even the most complex ligations. _I've literally never found someone who did these complexity ligations. So it solves a problem that doesn't really exist. Really what is needed is for the function to only simulate ligation with ordered pairs of overhangs (so just checks that no two fragments share overhangs, then line them up, and put em together). Far simpler, and no need to even call seqhash.

The most complex use cases for combinatorial assembly can then simply iterate through each "part type" and call the function. The most-most complex use case I built for (combinatorial and variant overhang fragments) should IMO just be removed. Nuke the code [not a request in this PR, but something I'd like to eventually do].

Something like the EnzymeManager could make that flexibility possible and allow consumers to load whatever data they want from wherever they want.

I like the EnzymeManager.

Reviewing now.

Copy link
Contributor

@Koeng101 Koeng101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One change request, which actually changes the usage of the package a little but is a feature I'd actually like to implement.

I like the changes though, and I think this is good for a merge.

Comment on lines 11 to 18
// Eventually, we want to get the data for this map from ftp://ftp.neb.com/pub/rebase
func getBaseRestrictionEnzymes() []Enzyme {
return []Enzyme{
{"BsaI", regexp.MustCompile("GGTCTC"), regexp.MustCompile("GAGACC"), 1, 4, "GGTCTC"},
{"BbsI", regexp.MustCompile("GAAGAC"), regexp.MustCompile("GTCTTC"), 2, 4, "GAAGAC"},
{"BtgZI", regexp.MustCompile("GCGATG"), regexp.MustCompile("CATCGC"), 10, 4, "GCGATG"},
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should expose this to consumers. The number of enzymes actually used in GoldenGates is super limited. Tbh, I think the one change would be to have these as a nice default that users can use, as well as adding the following enzymes:

    "PaqCI": {"PaqCI", regexp.MustCompile("CACCTGC"), regexp.MustCompile("GCAGGTG"), 4, 4, "CACCTGC"},
    "BsmBI": {"BsmBI", regexp.MustCompile("CGTCTC"), regexp.MustCompile("GAGACG"), 1, 4, "CGTCTC"},
    "SapI": {"SapI", regexp.MustCompile("GCTCTTC"), regexp.MustCompile("GAAGAGC"), 1, 3, "GCTCTTC"}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@TwFlem TwFlem marked this pull request as ready for review November 4, 2023 12:24
@TwFlem
Copy link
Contributor Author

TwFlem commented Nov 4, 2023

No problem and nothing to be sorry for! I was able to jump right in and reason about.

Also it could be super valuable to optimize seqhash

I'm interested: how so? ...

No clue! I'd have to take a closer look and understand what's going on under the hood. I only say this because that seemed to be a hotspot for Golden Gate. If performance becomes a problem- this might be a good place to start looking.

This makes sense to me. I've always thought an iterative approach would be better.

Yeah, I think this would be worth trying out and measuring.

And ah- so maybe no point in trying to optimize this any further if it might get completely changed up.

@Koeng101
Copy link
Contributor

Koeng101 commented Nov 5, 2023

I only say this because that seemed to be a hotspot for Golden Gate

I just wonder if there is anyway to optimize it (booth rotation algorithm is pretty good). Guessing best way is to just remove the need for it.

And ah- so maybe no point in trying to optimize this any further if it might get completely changed up.

Yes, but this solves a race condition, so definitely worth a merge

Last thing: Can you add to the changelog? Then we can go for a merge. Thanks!

}
}

// If, on a linear sequence, the last overhang's position + EnzymeSkip + EnzymeOverhangLen is over the length of the sequence, remove that overhang.
for _, overhangSet := range [][]Overhang{forwardOverhangs, reverseOverhangs} {
if len(overhangSet) > 0 {
if !seq.Circular && (overhangSet[len(overhangSet)-1].Position+enzyme.Skip+enzyme.OverhangLen > len(sequence)) {
if !part.Circular && (overhangSet[len(overhangSet)-1].Position+enzyme.Skip+enzyme.OverheadLength > len(sequence)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could there be some explanation of the logic behind this line and what it's doing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is if the recognition sequence is at the very end of a circular DNA, and then cutting on the other side (ie, the beginning of the sequence).

t.Errorf("Failure of GoldenGate on incorrect enzyme should follow the exact string `Enzyme EcoRFake not found in enzymeMap`. Got: %s", err.Error())
}
}

func ExampleGoldenGate() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TwFlem we should have an example test for GoldenGate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it to 'ExampleEnzymeManager_GoldenGate' for the linter- it's still there!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well- now it is back to ExampleGoldenGate :)

clone/clone.go Outdated
@@ -333,14 +320,24 @@ Specific cloning functions begin here.

// GoldenGate simulates a GoldenGate cloning reaction. As of right now, we only
// support BsaI, BbsI, BtgZI, and BsmBI.
func GoldenGate(sequences []Part, enzymeStr string) ([]string, []string, error) {
func (enzymeManager EnzymeManager) GoldenGate(sequences []Part, enzymeStr string) (openConstructs []string, infiniteLoops []string, err error) {
Copy link
Collaborator

@TimothyStiles TimothyStiles Nov 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should GoldenGate really be a method of EnzymeManager rather than its own function?

Copy link
Contributor Author

@TwFlem TwFlem Nov 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After learning more about how Golden Gate works IRL and a little bit more about SynBio- probably not.

It seems more appropriate for Golden Gate to just accept something like a 'cuttingEnzyme' of type 'Enzyme' and have the caller lookup whatever enzyme they want however they want.

This is also nice because now Golden Gate doesn't have to return an error anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally would like the change of GoldenGate to be its own function, independent of an enzymeManager.

Copy link
Contributor Author

@TwFlem TwFlem Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to no longer be a receiver

@TimothyStiles TimothyStiles merged commit e96a63e into bebop:main Nov 15, 2023
3 checks passed
@TwFlem TwFlem deleted the cloning-refactor branch November 16, 2023 05:32
@TwFlem TwFlem mentioned this pull request Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data race causes flaky test in "clone"
3 participants